A Japanese Word Dependency Corpus

نویسندگان

  • Shinsuke Mori
  • Hideki Ogura
  • Tetsuro Sasada
چکیده

In this paper, we present a corpus annotated with dependency relationships in Japanese. It contains about 30 thousand sentences in various domains. Six domains in Balanced Corpus of Contemporary Written Japanese have part-of-speech and pronunciation annotation as well. Dictionary example sentences have pronunciation annotation and cover basic vocabulary in Japanese with English sentence equivalent. Economic newspaper articles also have pronunciation annotation and the topics are similar to those of Penn Treebank. Invention disclosures do not have other annotation, but it has a clear application, machine translation. The unit of our corpus is word like other languages contrary to existing Japanese corpora whose unit is phrase called bunsetsu. Each sentence is manually segmented into words. We first present the specification of our corpus. Then we give a detailed explanation about our standard of word dependency. We also report some preliminary results of an MST-based dependency parser on our corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and its Application

In Japanese, the syntactic structure of a sentence is generally represented by the relationship between phrasal units, bunsetsus in Japanese, based on a dependency grammar. In many cases, the syntactic structure of a bunsetsu is not considered in syntactic structure annotation. This paper gives the criteria and definitions of dependency relationships between words in a bunsetsu and their applic...

متن کامل

基於非監督式詞義消歧之日語旅遊意見詞翻譯 (Japanese Opinion Word Translation Based on Unsupervised Word Sense Disambiguation in the Travel Domain) [In Chinese]

This paper proposes a Japanese opinion word translation method based on unsupervised word sense disambiguation. The method comprises the corpus preparation, opinion word dictionary construction, and weighting method. Different from the machine translation, our method does not need parallel corpora, tagged corpora or parsing tree banks. Our method is low-cost but effective, and requires a well-m...

متن کامل

Syntactic Reordering in Preprocessing for Japanese → English Translation: MIT System Description for NTCIR-7 Patent Translation Task

We experimented with a well-known technique of training a Japanese English translation system on a Japanese training corpus that has been reordered into an English-like word order. We achieved surprisingly impressive results by naively reordering each Japanese sentence into reverse order. We also developed a reordering algorithm that transforms a Japanese dependency parse into English word order.

متن کامل

Simultaneous English-Japanese Spoken Language Translation Based on Incremental Dependency Parsing and Transfer

This paper proposes a method for incrementally translating English spoken language into Japanese. To realize simultaneous translation between languages with different word order, such as English and Japanese, our method utilizes the feature that the word order of a target language is flexible. To resolve the problem of generating a grammatically incorrect sentence, our method uses dependency st...

متن کامل

Exploiting Headword Dependency and Predictive Clustering for Language Modeling

This paper presents several practical ways of incorporating linguistic structure into language models. A headword detector is first applied to detect the headword of each phrase in a sentence. A permuted headword trigram model (PHTM) is then generated from the annotated corpus. Finally, PHTM is extended to a cluster PHTM (C-PHTM) by defining clusters for similar words in the corpus. We evaluate...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014